Network motifs come in sets: correlations in the randomization process 
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The identification of motifs — subgraphs that appear significantly more often in a particular net- 
work than in an ensemble of randomized networks — has become a ubiquitous method for uncovering 
potentially important subunits within networks drawn from a wide variety of fields. We find that 
the most common algorithms used to generate the ensemble from the real network change subgraph 
counts in a highly correlated manner, so that one subgraph's status as a motif may not be indepen- 
dent from the statuses of the other subgraphs. We demonstrate this effect for the problem of 3- and 
4-node motif identification in the transcriptional regulatory networks of E. coli and S. cerevisiae 
in which randomized networks are generated via an edge-swapping algorithm (Milo et al., Science 
298:824, 2002). We show that correlations among 3-node subgraphs are easily interpreted, and we 
present an information-theoretic tool that may be used to identify correlations among subgraphs of 
any size. 



Identifying motifs has become a standard way to probe 
the functional significance of biological, technological, 
and sociological networks [IJ 121 El El HD HD E]- A motif 
is commonly denned as a subgraph whose number of ap- 
pearances in a particular network is significantly greater 
than its average number of appearances in an ensemble of 
networks generated under some null model [7J . The typ- 
ical null model prescribes an algorithm by which many 
randomized networks can be produced from the original 
network (see, e.g., Milo et al. [8] for a review and com- 
parison of several such algorithms) . While using an en- 
semble generated from the actual network often preserves 
features of the network that are desired for fair compar- 
ison (e.g. the degree distribution), this method may also 
induce unintended correlations in subgraph counts that 
ultimately influence the labeling of subgraphs as motifs. 
The purpose of this note is to demonstrate and interpret 
such correlations in a simple case and describe how mu- 
tual information may be used to identify such correlations 
in general. 
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FIG. 1: All possible 3-node subgraphs, labeled as in Alon et 
al.'s "motif dictionary" [9j. 



METHODS 

Following Milo et al. [7J , we perform 3- and 4-node mo- 
tif detection on the transcriptional regulatory networks 
of E. coli (version 1.1) and S. cerevisiae, using their freely 
available network data and software (mfinder version 
1.2)0- 

Generation of randomized networks from the actual 
network is performed according to one of three null mod- 
els: an edge-swapping algorithm, an edge-matching al- 
gorithm, and a Monte Carlo algorithm, all described 
in detail in [8]. Because significance results are simi- 
lar among models (cf. Results and [5]), emphasis in this 
note is placed on the edge-swapping algorithm, a Markov 
Chain procedure that repeatedly swaps the target nodes 
between pairs of edges. £?-scores are computed from the 
mean and standard deviation of the count of a particular 
subgraph within an ensemble of at least 1,000 random- 
ized networks [7J. 

We quantify correlation between the counts of any 
two subgraphs over the course of a randomization pro- 
cess using mutual information [10]. Mutual information 
captures correlation between two random variables even 
when a relationship exists that is nonlinear (unlike, e.g., 
the correlation coefficient) or non-monotonic (unlike, e.g., 
Spearman's rho). In this study, the counts rn and rij of 
the ith and jth subgraphs at each iteration of the edge- 
swapping process are used to increment a counts matrix 
from which the joint probability distribution p(n^, rij) is 
obtained by normalization. Mutual information is 
computed as 

where the log is base 2 to give J™ in bits, and p(rii) = 



2 



Mutual information is bounded from below by 
(as seen in Eqn. [T] when there is no correlation and 
the subgraph counts are independent of each other i.e. 
p(rii,nj) — p(rii)p(nj)) and bounded from above by the 
smaller of the two variables' entropies, where 



Hi = -^p(n J )log 2 p(n l ) 



(2) 



and the analogous expression with i — > j are the entropies 
of the ith and jth subgraphs' counts respectively. In 
order to obtain a statistic that can be compared across 
all subgraph pairs, we normalize by the average entropy, 
defining 



(Hi 



as our measure of correlation. Note that < 
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and 



= 1 when i = 



We find qualitatively similar results (cf. Results) when 
normalizing by the minimum, instead of the average, en- 
tropy. 
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RESULTS 
An interpretable correlation 

Only four 3-node subgraphs are present in the tran- 
scriptional network of E. coli, and a if-score analysis of 
the type performed in [7] reveals a curious effect. Specif- 
ically, with respect to ensembles generated via any of the 
edge-swapping, edge-matching, and Monte Carlo algo- 
rithms [8], the Z-scores of three of the subgraphs (IDs 6, 
12, and 36; cf. Fig. [If are either very close or equal to the 
negative of the Z-score of the fourth subgraph (the feed- 
forward loop, ID 38); see Fig. [2} In fact, as shown in Fig. 
[3] the absolute value of the difference in counts within the 
actual network and counts within a sample randomized 
network at each iteration of the edge-swapping algorithm 
is the same among all four subgraphs for the first 1,000 it- 
erations. The interpretation is simple: as detailed in Fig. 
[4] each time an edge of a feed-forward loop is swapped 
with an external edge, the feed-forward loop is destroyed 
and one of each of the other three subgraphs is created; 
using subgraph IDs we may denote this process as 

38^6,12,36. (4) 

Since this process accounts for the overwhelming major- 
ity of the changes in count of the latter three subgraphs, 
there is extremely high correlation among the counts of 
all four subgraphs in each randomized network, and the 
magnitudes of their Z-scores are very close. 
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FIG. 2: The four 3-node subgraphs that appear in the E. coli 
network, ^-scores are calculated with respect to an ensem- 
ble of 1,000 randomized networks, generated via the edge- 
swapping algorithm [5j. Plots show the count of each sub- 
graph during the generation of one randomized network. Each 
iteration corresponds to one edge-swap. 
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FIG. 3: Absolute value of the difference between count no in 
the actual network and count n at each iteration of the edge- 
swapping algorithm, for the subgraphs in Fig. [2] Note that 
all four curves completely overlap. 
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FIG. 4: Illustration of a correlation-producing effect. Panel 
A shows subgraph 38 (the feed-forward loop) and an external 
edge, between which there are three possible edge swaps: a 
swap of edges 3 and 4 (panel B), a swap of edges 2 and 4 
(panel C), and a swap of edges 1 and 4 (panel D). In all three 
cases B, C, and D, subgraph 38 is reduced by one count, and 
subgraphs 6, 12, and 36 are each increased by one count. 



An information-theoretic tool 

To quantify and extend the detection of correlations 
such as that just described, we use a normalized mutual 
information measure, as detailed in Methods. For the 
cases of 3- and 4-node subgraphs in both the E. coli and 
S. cerevisiae transcriptional networks, the measure a,j 
(cf. Eqn. [3J is computed between all pairs of subgraphs i 
and j that appear during the randomization of a network 
via the edge-swapping algorithm. Figs. [Sfl6] show the ma- 



trices dij] the row and column order is determined by 
summing along either direction and sorting, which tends 
to group together sets of subgraphs with high pairwisc 
correlations. 

During randomization of the E. coli network, a set of 
four 3-node subgraphs (IDs 6, 12, 36, and 38; cf. Fig. 
[T| are highly correlated, as shown by the bright 4-by- 
4 square in Fig. [5|\. The high correlation is simply the 
result of the effect described in the previous section, in 
which any of three swaps overwhelmingly converts a feed- 
forward loop (ID 38) into three other subgraphs (IDs 6, 
12, and 36). In fact the same set of high correlations in 
seen during the randomization of the S. cerevisiae net- 



work, as shown by the upper left 4-by-4 square in Fig. 
[5j3. There are additional correlated sets in S. cerevisiae: 
subgraphs 14, 74, and 102 are highly correlated as indi- 
cated by the bright 3-by-3 square involving these IDs in 
Fig. [5j3 , and subgraphs 74 and 108, as well as 14 and 
46, are correlated as indicated by the relatively bright 
entries at these coordinate pairs in Fig. |5j3. Respectively, 
these correlations are due to the effects (in the notation 
of Eqn. |4j) 



102 -> 12,14,74, 
108 -> 6,74,74, 
46 -> 14,14,36, 



(5) 
(6) 
{<) 



of which one may convince oneself with the aid of Fig. [T] 
Note that although subgraphs 14, 102 and 108 partici- 
pate in the highly correlated effects described here, none 
changes in number significantly enough upon randomiza- 
tion to be labeled a motif in the S. cerevisiae network 
(subgraphs 46 and 74 do not appear in the actual net- 
work, only during the course of the randomization) . 

Our analysis reveals correlations between counts of 4- 
node subgraphs as well. As indicated by the bright blocks 
and off-diagonal elements in Fig. [6] several sets of sub- 
graphs are highly correlated during the randomization of 
both the E. coli and S. cerevisiae networks. Correlations 
are less easily interpreted in the 4-node case than in the 
3-node case, but one must nonetheless remain aware of 
such artifacts of the randomization process when identi- 
fying subgraphs as motifs. We note that the bi-fan (ID 
204), the 4-node subgraph commonly identified as a mo- 
tif in a variety of networks including both transcriptional 
networks studied here [7J, does not exhibit particularly 
high correlation with any other subgraph under our mea- 
sure in either the E. coli or S. cerevisiae network. 

We find results qualitatively similar to Figs. 5]|6 when 
normalizing by the minimum, instead of the average, en- 
tropy in Eqn. [3] The technique we describe here can be 
extended to the detection of subgraphs of any size. 

DISCUSSION 

By quantifying correlations among subgraph counts 
during 3- and 4-node motif detection in the transcrip- 
tional networks of E. coli and S. cerevisiae, we reveal 
that motifs come in sets: the destruction of a subgraph 
during the randomization process can be highly corre- 
lated with the creation of one or more other subgraphs. 
The correlations are easily understood in the 3-node case, 
and we present an information-theoretic tool to extract 
such correlations in general. It has not escaped our atten- 
tion that this observation serves as the basis for a more 
principled clustering of subgraphs based on correlations 
(e.g., by mixture-modeling in which the state of the sub- 
graph count is a mixture of several states, with counts 
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FIG. 5: Correlation measure a,ij (cf. Eqn.[3| between all pairs 
of 3-node subgraphs i and j that appear during the random- 
ization of a network via the edge-swapping algorithm [8] for 
the transcriptional networks of E. coli (A) and S. cerevisiae 
(B). Subgraphs are labeled as in Fig.[lj The row and column 
order is determined by summing along either direction and 
sorting. 




Subgraph ID 

FIG. 6: Correlation measure ctij (cf. Eqn.[3| between all pairs 
of 4-node subgraphs i and j that appear during the random- 
ization of a network via the edge-swapping algorithm [8] for 
the transcriptional networks of E. coli (A) and S. cerevisiae 
(B). Subgraphs are labeled as in Alon et al.'s "motif dictio- 
nary" [9]. The row and column order is determined by sum- 
ming along either direction and sorting. 



conditionally independent given the state) . 

The correlations among subgraphs are artifacts of the 
algorithm used to generate the ensemble of randomized 
networks; although we demonstrate their existence here 
in the context of only one randomization algorithm, the 
edge-swapping algorithm, they occur in other commonly 
used algorithms, as evidenced by mutually consistent ef- 
fects on the Z-scores. These findings do not necessarily 
invalidate the statuses of commonly identified motifs (it 
remains the case, for example, that there are significantly 
more feed-forward loops in the transcriptional network of 
E. coli than in a random network generated under most 
any commonly used null model); they do argue, however, 
that the limitations of the randomization scheme should 
be fully recognized during the motif finding process. 
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